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Differentially Private Distributed Online Learning 
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Abstract —In this paper, we propose a novel distributed online learning algorithm to handle massive data in Big Data era. Comparing 
to the typical centralized scenario, our proposed distributed online learning has multi-learners. Each learner optimizes its own learning 
parameter based on local data source and communicates timely with neighbors. We study the regret of the distributed online learning 
algorithm. However, communications among the learners may lead to privacy breaches. Thus, we use differential privacy to preserve 
the privacy of the learners, and study the influence of guaranteeing differential privacy on the regret of the algorithm. Furthermore, our 
online learning algorithm can be used to achieve fast convergence rates for offline learning algorithms in distributed scenarios. We 
demonstrate that the differentially private offline learning algorithm has high variance, but we can use mini-batch to improve the 
performance. The simulations show that our proposed theorems are correct and our differentially private distributed online learning 
algorithm is a general framework. 

Index Terms —Distributed Optimization, Online Learning, Differential Privacy, offline learning, mini-batch, 
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1 Introduction 

As the Internet develops rapidly, increasingly more informa¬ 
tion is put online. For example, in daily life, tens of millions 
of people on Facebook often share their photos on personal 
pages and post stories of life in the comments, which 
makes Facebook process a large scale of data every second. 
Processing such a large scale of data in an efficient way is 
a challenging issue. In addition, as an online interaction 
platform, Internet should offer people a real-time service. 
This makes Internet companies (e.g., Google, Facebook and 
YouTube) have to response and update their systems in real 
time. To provide better services, they need to learn and 
predict the user behavior based on the past information 
of users. ITence, the notion "online learning" was intro¬ 
duced by researchers. In early stages, most online learning 
algorithms proceed in a centralized approach. However, as 
the data volume grows exponentially large in Big Data era, 
typical centralized online learning algorithms are no longer 
capable of processing such large-scale and high-rate online 
data. Besides, online data collection is inherently decentral¬ 
ized because data sources are often widely distributed in 
different geographical locations. So it is much more natural 
to develop a distributed online learning algorithm (DOLA) 
to solve the problem. 

During the learning process, sharing information may 
leads to privacy breaches. For instance, the hospitals in a 
city want to conduct a survey (can be regarded as a learning 
process) of the diseases that citizens are susceptible to. To 
protect the sensitive information of patients, the hospitals 
obviously can't release their cases of illness. Instead, each 
hospital just can share some limited information with other 
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hospitals. However, different patient samples lead to dif¬ 
ferent results. Through analyzing the results, the adversary 
is able to obtain some sensitive information about certain 
patients whose cases are only included in one hospital. 
Faced with this kind of privacy breach, the problem is how 
we can preserve the privacy of participants in the survey 
without significantly affecting the accuracy of the survey. To 
solve this class of problems, we urge to propose a privacy¬ 
preserving algorithm, which not only effectively processes 
distributed online learning, but also protects the privacy of 
the learners. 

In this paper, we propose a differentially private dis¬ 
tributed online learning algorithm with decentralized learn¬ 
ers and data sources. The algorithm addresses two issues: 
1) distributed online learning; 2) privacy-preserving guar¬ 
antees. Specifically, we use distributed convex optimization as 
the distributed online learning model, while use differential 
privacy (ll] to protect the privacy. 

Distributed convex optimization is considered as a con- 
urohlem (2]- To solve this problem, some related works 
have been done. These papers considered a multi¬ 
agents network system, where they studied distributed con¬ 
vex optimization for minimizing a sum of convex objective 
functions. For the convergence of their algorithms, each 
agent updates the iterates with usual convex optimization 
method and communicates the iterates to its neighbors. To 
achieve this goal, a time-variant communication matrix is 
used to conduct the communications among the agents. The 
time-variant communication matrix makes the distributed 
optimization algorithm converge faster and better than the 
fixed one used in |^. For our work, the first issue is how the 
DOLA performs compared with the centralized algorithm. 
To this end, we use some results of the above works to 
compute the regret bounds of our DOLA. 

Differential privacy Ql] is a popular privacy mechanism 
to preserve the privacy of the learners. A lot of progress has 
been made on differential privacy. This mechanism prevents 
the adversary from gaining any meaningful information of 
any individuals. This privacy-preserving method is scalable 
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for large and dynamic dataset. Specifically, it can provide 
the rigorous and quantitative demonstrations for the risk 
of a privacy breach in statistical learning algorithms. Many 
privacy-preserving algorithms have been proposed 

to use differential privacy to protect sensitive information 
in the centralized offline learning framework. However, in 
the distributed learning framework, there is seldom research 
effort. 

Furthermore, our differentially private DOLA can be 
used to achieve fast convergence rates for differentially 
private distributed offline learning algorithm based on [ 1^ . 
Since the offline learning algorithm has access to all data, the 
technique of mini-batch 11 111 is used to reduce the high vari¬ 
ance of the differentially private offline learning algorithm. 
Motivated by and IHTI , we try to obtain a good utility of 
the distributed offline learning algorithm while protect the 
privacy of the learners. More importantly, our differentially 
private distributed offline learning algorithm guarantees the 
same level of privacy as fhe DOLA with less random noise 
and achieves fast convergence rate. 

Following are the main contributions of fhis paper; 

■ We present a DOLA (i.e.. Algorithm 1), where each 
learner updates its learning parameter based on 
local data source and exchanges information with 
neighbors. We respectively obtain the classical regret 
bounds 0{\/T) (l2 ll and 0(log T) fl^ for convex and 
strongly convex objective functions for the algorithm. 

• To protect the privacy of learners, we make our 
DOLA guarantee e-differential privacy. Interestingly, 
we find thaf the private regret bounds has the same 
order of 0{VT) and OjlogT) with the non-private 
ones, which indicates that guaranteeing differential 
privacy in the DOLA do not significantly hurt the 
original performance. 

• We use fhe differentially private DOLA with good re¬ 
gret bounds to solve differentially private distributed 
offline learning problems (i.e.. Algorithm 2) for the 
first time. We make Algorithm 2 have tighter utility 
guarantees than the existing state-of-the-art results 
while guarantee e-differential privacy. 

• We use mini-batch to reduce high variance of the 
differentially private distributed offline learning al¬ 
gorithm and demonstrate that the algorithm using 
mini-batch guarantees the same level of privacy with 
less noise. 

The rest of the paper is organized as follows. Section 2 
discusses some related works. Section 3 presents prelimi¬ 
naries for the formal disfribufed online learning. Section 4 
proposes the differentially private distributed online learn¬ 
ing algorithm. We discuss the privacy analysis of our DOLA 
in Section 4.1 and discuss the regret bounds in Section 4.2. 
In Section 5, we discuss the application of the DOLA to the 
differentially private distributed offline learning algorithm. 
Section 5.1 and 5.2 discuss the privacy and the regret re¬ 
spectively. In Section 6, we present simulation results of the 
proposed algorithms. Finally, Section 7 concludes the paper. 

2 Related Work 

Jain et al. (3l studied the differentially private central¬ 
ized online learning. They provided a generic differentially 


private framework for online algorithms. They showed 
that using their generic framework, Implicit Gradient De¬ 
scent (IGD) and Generalized Infinitesimal Gradient Ascent 
(GIGA) can be transformed into differentially private online 
learning algorithms. Their work motivates our study on the 
differentially private online learning in distributed scenar¬ 
ios. 

Recently, growing research effort has been devoted to 
distributed online learning. Yan et al. @] has proposed a 
DOLA to handle the decentralized data. A fixed network 
topology was used to conduct the communications among 
the learners in their system. They analyzed the regret 
bounds for convex and strongly convex functions respec¬ 
tively. Further, they studied the privacy-preserving prob¬ 
lem, and showed that the communication network made 
their algorithm have intrinsic privacy-preserving properties. 
Worse than differential privacy, their privacy-preserving 
method carmot protect the privacy of all learners absolutely. 
Because their privacy-preserving properties depended on 
the connectivity between two nodes, however, all the nodes 
carmot have the same connectivity in a fixed communication 
matrix. Besides, Huang et al. 1141] is closely related to our 
work. In their paper, they presented a differentially private 
distributed optimization algorithm. While guaranteed the 
convergence of fhe algorithm, they used differential privacy 
to protect the privacy of fhe agents. Finally, they observed 
that to guarantee e-differential privacy, their algorithm had 
the accuracy of fhe order of 0(^)- Comparing to this 
accuracy, we obtain not only O(^) rates for convex func¬ 
tions, but also 0(p) rates for strongly convex functions, if 
our regret bounds of the differentially private DOLA are 
converted to convergence rates 

The method to solve distributed online learning was 
pioneered in distributed optimization. Hazan has studied 
online convex optimization in his book Hi . They proposed 
that the framework of convex online learning is closely 
tied to statistical learning theory and convex optimiza¬ 
tion. Duchi et al. m developed an efficient algorithm for 
disfribufed optimization based on dual averaging of sub¬ 
gradients method. They demonstrated that their algorithm 
could work, even the communication matrix is random and 
not fixed. Nedic and Ozdaglar jj] considered a subgra¬ 
dient method for distributed convex optimization, where 
the functions are convex but not necessarily smooth. They 
demonstrated that a time-variant communication could en¬ 
sure the convergence of the distributed optimization algo¬ 
rithm. Ram et al. tried to analyze the influence of stochas¬ 
tic subgradient errors on distributed convex optimization 
based on a time-variant network topology. They studied 
the convergence rate of their distributed optimization algo¬ 
rithm. Our work extends the works of Nedic and Ozdaglar 
0] and Ram et al. 0]. All these papers have made great 
contributions to distributed convex optimization, but they 
did not consider the privacy-preserving problem. 

As for the study of differential privacy, there has been 
much research effort being devoted to how differential pri¬ 
vacy can be used in existing learning algorithms. For exam¬ 
ple, Chaudhuri et al. 0] presented the output perturbation 
and objective perturbation ideas about differential privacy 
in empirical risk minimization (ERM) classification. They 
achieved a good utility for ERM algorithm while guaranteed 
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e-differential privacy. Rajkumar and Agarwal fill extended 
differentially private ERM classification @] to differentially 
private ERM multiparty classification. More importantly, 
they analyzed the sequential and parallel composability 
problems while the algorithm guaranteed e-differential pri¬ 
vacy. Bassily et al. Iisl] proposed more efficient algorithms 
and tighter error bounds for ERM classification on the basis 
of i). 

Some papers have discussed the application of online 
learning with good regret to offline learning. Kakade and 
Tewari o proposed some properties of online learning 
algorithms if the loss function is Lipschitz and strongly con¬ 
vex. They found that recent online algorithms with logarith¬ 
mic regret guarantees could help to achieve fast convergence 
rates for the excess risk with high probability. Subsequently, 
Jain et al. @] use the results in lldl to analyze the utility of 
differentially private offline learning algorithms. 

3 Preliminaries 

Notation: Upper case letters (e.g., A or W) denote matrices 
or data sets, while lower case letters (e.g., a or w) denote 
elements of matrices or column vectors. Eor instance, we 
denote the f-th learners parameter vector at time thywl. 
w[j] denotes the j-th component of a vector w of length N. 
aij denotes the (f, j)-th element of A. Unless special remark, 
II'll denotes the Euclidean norm ||w|| := w[i]'^ and 

(•,•) denotes the inner product (x, y) = x'^y. at denotes 
the stepsize. 

Centralized Online Learning: Given the information of 
the correct results to previous predictions, online learning 
aims at making a sequence of predictions. Online learning 
algorithms proceed in rounds. At round t, the learner gets 
a question xt, taken from a convex set X and should 
give an answer denoted by pt to this question. Einally, 
the correct answer yt is given to be compared with pt. 
Specifically, in online regression problems, Xt denotes a 
vector of features, then pt t— {wt, Xt) is a sequence of linear 
predictions, and comparing pt with yt leads to the loss 
function £(Ly4,Xt,yt) {e.g., e{wt,Xt,yt) = \{w,Xt) - yt\). 
We let ft{w) := £{w,Xt,yt), which is obviously a convex 
function. According to the definition of online learning 
regret, the goal of online learning model is to minimize the 
function: 

T T 

Rc = '^ ft {wt ) - min '^ft{w), ( 1 ) 

where lU C M". 

In this paper, distributed online learning model is devel¬ 
oped on the basis of the above description. 

Distributed Convex Optimization: Besides basic as¬ 
sumptions for datasets and objective functions, how con¬ 
ducting the communications among the distributed learn¬ 
ers is critical to solve the distributed convex optimization 
problem in our work. Since the learners exchange infor¬ 
mation with neighbors while they update local parameters 
with subgradients, a time-variant m-by-m doubly stochastic 
matrix At is proposed to conduct the communications. At 
has a few properties: 1 ) all elements of At are non-negative 
and the sum of each row or column is one; 2 ) ap (t) > 0 
means there exists a communication between the f-th and 


j-th learners at round t, while = 0 means non¬ 

communication between them; 3) there exists a constant 77 , 
0 < 77 < 1 , such that ap (t) > 0 implies that ap (t) > 77 . 

For distributed convex optimization, two assumptions 
must be made. First, we make the following assumption on 
the dataset W and the cost functions ff 

Assumption 1. The set W and the cost functions fl are 
such that 

(1) The set W is closed and convex subset of K". Let 
R = sup ||x — 7/11 denote the diameter of W. 

x,yGW 

(2) The cost functions // are strongly convex with modu¬ 
lus A > 0. For all x,y G W, we have 

{'^fly-x) < fl{y) - ft{x) - ^\\y-xf. (2) 

(3) The subgradients of fl are uniformly bounded, i.e., 
there exists L > 0 , for all x G W, we have 

||V/ax)||<L. (3) 

Assumption (1) guarantees that there exists an optimal 
solution in our algorithm. Assumptions (2) and (3) help us 
analyze the convergence of our algorithm. 

To recall, the learners communicate with neighbors 
based on the matrix of At . Each learner directly or indirectly 
influences other learners. For a clear description, we denote 
the communication graph for a learner i at round t by 

= {{hj) ■ > 0}, (4) 

where 

(Tij (t) G At. 

In our algorithm, each learner computes a weighted average 

of the m learners' parameters. For the convergence of 
the DOLA, the weighted average should make each learner 
have "equal" influence on other learners in long rounds. 
Then, we make the following assumption about the proper¬ 
ties of At . 

Assumption 2. For an arbitrary learner i, there exist a 
minimal scalar 77 , 0 < 77 < 1, and a scalar N such that 

(1) a„(f) > 0 for {i,j) G Ccit + 1), 

(2) EJliap(f) = 1 and^^iap(t) = 1, 

(3) aij{t) > 0 implies that ap (t -F 1) > 77 , 

(4) The graph yJk=i,...NQ{t + k)i is strongly connected 
for all k. 

FFere, Assumptions (1) and (2) state that each learner 
computes a weighted average of the parameters shown in 
AlgoritFtm F. Assumption (3) ensures that the influences 
among the learners are significant. Assumptions (2) and (4) 
ensure that the m learners are equally influential in a long 
run. Assumption 2 is crucial to minimize the regret bounds 
in distributed scenarios. 

Differential Privacy: Dwork Ql] proposed the definition 
of differential privacy for the first time. Differential privacy 
makes a data miner be able to release some statistic of 
its database without revealing sensitive information about 
a particular value itself. In this paper, we use differential 
privacy to protect the privacy of learners and give the 
following definition. 
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Definition 1. Let A denote our differentially private 
DOLA. Let X = be a sequence of ques¬ 

tions taken from an arbitrary learner's local data source. 
Let W = (w\,w'‘ 2 , be a sequence of T outputs of 

the learner and W = A{X). Then, our algorithm A is 
e-differentially private if given any two adjacent question 
sequences X and X' that differ in one question entry, the 
following holds; 

Pr [A (X) e IP] < Pr [A (X') € W] . (5) 

This inequality guarantees that whether or not an in¬ 
dividual participates in the database, it will not make any 
significant difference on the output of our algorithm, so the 
adversary is not able to gain useful information about the 
individual. 

4 Differentially Private Distributed On¬ 
line Learning 

For differentially private distributed online learning, we as¬ 
sume to have a system of m online learners, each of them has 
the independent learning ability. The i-th learner updates 
its local parameter wl based on its local data points {xlvl) 
with i G [l,m]. The learner makes the prediction {wl,x\) 
at round t , then the loss function fl{'w) := l{w,x\,yl) is 
obtained. Even though the m learners are distributed, each 
learner exchanges information with neighbors. Based on 
the time-variant matrix At, the learners communicate with 
different sets of their neighbors at different rounds, which 
makes them indirectly influenced by other data sources. 
Specifically, for a learner i, at each round t, it first gets the 
exchanged parameters and computes the weighted average 
of them, then updates the local parameter w\ with respect 
to the weighted average b\ and the subgradient gl, finally 
broadcasts the new local parameter added with a random 
noise to its neighbors Q{t)i. We summarize the algorithm in 
Algorithm 1. 

Before we discuss the privacy and utility of Algorithm 
1, the regret in distributed setting is given in the following 
definition. 

Definition 2. In an online learning algorithm, we assume 
to have m learners using local data sources. Each learner 
updates its parameter through a weighted average of the 
received parameters. Then, we measure the regret of the 
algorithm as 

T m T m 

t—1 i—1 t—1 i—1 

Obviously, ft{wt) in (1) is changed to the sum of m 
learners' loss function (6)- In centralized 

online learning algorithm, N data points need T = N 
rounds to be finished, while the distributed algorithm can 
handles mxN data points over the same time period. Notice 
that Rd is computed with respect to an arbitrary learner's 
parameter w{ (^. This states that single one learner can 
measure the regret of the whole system based on its local 
parameter, even though the learner do not handle all data 
in the system. 

Next, we analyze the privacy of Algorithm 1 in Section 
4.1 and give the regret bounds in Section 4.2. 


Algorithm 1 Differentially Private Distributed Online 
Learning 

1: Input: Cost functions fHw) := i{w,xl,yl), i S [l,m] 
and t G [0, T] ; initial points Wq,..., lc ™; double stochas¬ 
tic matrix At = G maximum iterations 

T. 

2: for t = 0, ..., T do 

3: for each learner i = 1,..., m do 

m 

4 : bl = + 1)(W + '^t)> where is a Laplace 

j=i 

noise vector in K" 

5 : gl G- 

6: wj+i = ProK - ott+i ■ ffti 

(Projection onto W) 

7: broadcast the output (wj+i -F crj+i) to g{t)^ 

8: end for 

9 : end for 


4.1 Privacy Analysis 

As explained previously, exchanging information may cause 
some privacy breaches, so we have to use differential pri¬ 
vacy to protect the privacy. In the view of Algorithm 1, all 
learners exchange their weighted parameters with neigh¬ 
bors at each round. Por preserving-privacy, every exchanged 
parameter should be made to guarantee differential privacy. 
To achieve this target, a random noise is added to the param¬ 
eter wl (see step 7 in Algorithm 1). This method to guarantee 
differential privacy is known as output perturbation @]. We 
have known where to add noise, next we study how much 
noise to be added. 

Differential privacy aims at weakening the significantly 
difference between A{X) and A{X'). Thus, to show dif¬ 
ferential privacy, we need to know that how "sensitive" the 
algorithm A is. Purther, according to (I|], the magnitude of 
the noise depends on the largest change that a single entry 
in data source could have on the output of Algorithm 1; 
this quantity is referred to as the sensitivity of the algorithm. 
Then, we define the sensitivity of Algorithm 1 in the follow¬ 
ing definition. 

Definition 3 (Sensitivity). Recall in Definition 1, for any 
X and X', which differ in exactly one entry, we define the 
sensitivity of Algorithm 1 at t-th round as 

S(t)=sup M(A’)-^(A")|li. (7) 

X,X' 

The above norm is Li-norm. According to the notion 
of sensitivity, we know that higher sensitivity leads to 
more noise if the algorithm guarantees the same level of 
privacy. By bounding the sensitivity S{t), we determine the 
magnitude of the random noise to guarantee e-differential 
privacy. We compute the bound of S{t) in the following 
lemma. 

Lemma 1. Under Assumption 1, if the Li-sensitivity of 
the algorithm is computed as (7), we obtain 

S(f) < 2atV^L, (8) 

where n denotes the dimensionality of vectors. 

Proof. Recall in Definition 1, X and X' are any two data sets 
differing in one entry, wl is computed based on the data set 
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X while wl' is computed based on the data set X'. Certainly, 
we have ||^ (X) — A (-^Olli = ||^^t “ ■ 

For datasets X and X' we have 

wl = Pro [bl_^ - atgl_^] and wl' = Pro . 

Then, we have 

Ih* “ = ||P''o [^t-i “ - Pro [&i_i - atsi-/] ||^ 

< - OLtgl-i) - - at3tCi')|^ 

= at ||s?-i - 

^ at + Ist-i 1^) 

< atVn 

< 2aty/nL. ( 9 ) 

By Definition 3, we know 

S(f)< (10) 

Hence, combining (9) and (10), we obtain (8). □ 

We next determine the magnitude of the added random 
noise due to (10). In step 7 of Algorithm 1, we use a to 
denote the random noise, cr G K" is a Laplace random 
noise vector drawn independently according to the density 
function: 

Lap {x\p) = ^ exp ^ (H) 

where p = S {t) /e. We let Lap [p] denote the Laplace 
distribution. (8) and (10) show that the magnitude of the 
added random noise depends on the sensitivity parameters: 
e, the stepsize at, the dimensionality of vectors n, and the 
bounded subgradient L. 

Lemma 2. Under Assumption 1 and 2, at the f-th round, 
the f-th online learner's output of A, wl, is e-differentially 
private. 

Proof. Let wl = wl + al and wl' = wl + af then by the 
definition of differential privacy (see Definition 1), wl is e- 
differentially private if 

Pt[wI G W] < e" Pt[wI' G W]. (12) 


For w G W, we obtain 



< exp (e), (13) 


where the first inequality follows from the triangle inequal¬ 
ity, and the last inequality follows from (10). □ 

McSherry fl^ has proposed that the privacy guarantee 
does not degrade across rounds as the samples used in the 
rounds are disjoint. In Algorithm 1, at each round, each 
learner is given a question xf then makes the prediction 
wf Finally, given the correct answers yf each learner can 
obtain the loss functions fl{w) := £{w,xl,yl). In this 
process, we regard (xj ,yl) as a sample. During the T rounds 
of Algorithm 1, these samples are disjoint. Therefore, as 
Algorithm 1 runs, the privacy guarantee will not degrade. 
Then we obtain the following theorem. 

Theorem 1 (Parallel Composition). On the basis of 
Definition 1 and 3, under Assumption 1 and Lemma 2, our 
DOLA (see Algorithm 1) is e-differentially private. 

Proof. This proof follows from the theorem 4 of ili]. The 
probability of the output W (defined in Definition 1) is 

T 

Pr [A (A) G W] = P Pr[A(A), G W]. (14) 

Using the definition of differential privacy for each out¬ 
put (see Lemma 2), we have 

T 

Y[Pv[AiX\€W] 

T T 

< n Pr[-4(V), G W] X n exp (e x lA* 0 X'f) 

T 

< P Pr[A(V)j G W] X exp (e X lA © V|), (15) 

t=i 

where \X (B X'\ denotes the different entry between X and 
A". □ 

Intuitively, the above inequality states that the ultimate 
privacy guarantee is determined by the worst of the privacy 
guarantees, not the sum Te. 

Combining (8), (11) and Lemma 2, we find that if each 
round of Algorithm 1 has the privacy guarantee at the same 
level (e-differential privacy), the magnitude of the noise will 
decrease as Algorithm 1 runs. That is because the magnitude 
of the noise depends on the stepsize at+i, which decreases 
as the subgradient descends. 

4.2 Regret Analysis 

The regret of online learning algorithm represents a sum 
of mistakes, which are made by the learners during the 
learning and predicting process. That means if Algorithm 1 
runs better and faster, the regret of our distributed online 
learning algorithm will be lower. In other words, faster 
convergence rate ensures that the m learners make less 
mistakes and predict more accurately. Hence, we bound the 
regret Rd through the convergence of wl in Algorithm 1. 

To analyze the convergence of wf we consider the be¬ 
havior of the time-variant matrix At. Let At be the matrix 
with (f, j)-th equal to (t) in Assumption 2. According to 
the assumption. At is a doubly stochastic. As mentioned 
previously, some related works have studied the matrix 











IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. X, NO. X, JANUARY 20XX 


6 


convergence of A^. For simplicity, we use one of these results 
to obtain the following lemma. 

Lemma 3 (Ql)- We suppose that at each round t, the 
matrix At satisfies the description in Assumption 2. Then, 
we have 

(1) lim (j){k, s) = — ee^/or all k,s G Z with k > s, 

k—¥oo 

where 

(j){k, s) = A{k)A{k — 1)A • • • y1(s + 1). (16) 

(2) Further, the convergence is geometric and the rate of 

convergence is given by 

- — 

L m 

where 

Lemma 3 will be repeatedly used in the proofs of the 
following lemmas. Next, we study the convergence of Al¬ 
gorithm 1 in details. We use subgradient descent method to 
make wl move forward to the theoretically optimal solution. 
Based on this method, we know that wIj^i is closer to the 
optimal solution than wl- Besides, we also want to know the 
difference between two arbitrary learners, but computing 
the norms —Wj || makes no sense. Alternatively, we 

study the behavior of where for all t, wt is 

defined by 


< (17) 


We conduct the mathematical induction for (20) and use 
the matrices (j){k, s) defined in (16). We then obtain 


w 


t-l-l — + X! ( X! 

fc=l \j=l 


fc=i \i=i 




Using (18) and (20), we rewrite Wt+i as follows 
1 


Wt+l = 


m 


m m 


t+1 


= — EE“bi^++E^t+i 

^ \j=ii=i i=i , 


= (EI E«u(i +1) IK + + E di+i 


m / m 


(23) 


(24) 


According to Assumption 2, we know + 1) = 1/ 

then simplify wt+i as 


Wt+l = 


m 


£(u;j+aD + E^: 


't+i 


= Wt + ' 


i=l 

1 


m ^ , 
2=1 


E(^* + ^t+i)- 


(25) 


^ r/i 

wt = —J2^t- (18) 

i—\ 

In the following lemma, we give the bound of 
||Wt - wj||. 

Lemma 4. Under Assumption 1 and 2, for all i G 
{1,..., to} and t G {1,..., T}, we have 

t—1 t—1 m 

||lTt — rctll < mL6'^2 E ll'^fell 

k—1 k—1 2=1 

(19) 

Proof. For simplicity, we first study ||Wt+i — rCt+iH instead. 
Define that 


^t+l “ '^t+l ('t! (20) 

where h\ is defined in step 4 of Algorithm 1. We next 
estimate the norm of dl for any t and i. According to the 
famous non-expansive property of the Euclidean projection 
onto a closed and convex W, for all x G W, we have 

||Pro[x]|| < ||a:|| . (21) 


Finally, we have 


-| t 771 o-r ± "L 

»<« = -EE'’j + -EE4, 

fe=l i=l k=l i=l 

Using (23) and (26), we obtain 

t m t+1 771 


t-\-l 771 


|Wt-El - Wt+i|| = 


-i t lit 1 o-r± lit 

-EE4 + -EE4 


k—1 2=1 


k—1 2=1 


(26) 


-1 (^i+i + EIE 


k=l \i=l 




+ EIE 

k=i \j=i 


ij^i 


* ™ /1 \ . . 

E E y~ “ (1 + 1; ^)]y j i^k + dl) 


+ ( E d[ 


m 


'i+i 


(27) 


Based on (20) and (21), using the definition of b\ and gl in According to the triangle inequality in Euclidean geometry. 
Algorithm 1, we obtain we further have 


\\cPt+i\\ = \\PHK-oct+igf\-b\ 

< ctt+i IlffJll 

< Oft+iL. 


( 22 ) 


t 771 

I Wt+l - wj+i|| <EE 

k—1 2=1 

1 


— [(/> (t +1, fc)]. 


m 


(Ik^ll + IKII) 


+ -J 2 H 


m 


t+il 




t-El| 


We use (3) in the last step. 


(28) 
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Using the bound of || in (22) and (17) in Lemma 3, we 
have 

i 

\\wt-\-i - wl^iW <niL9^l3^^^~^ak 

k^l 

t m 

+ 0^/3‘+i-'=^||4||+2at+iL. (29) 

k—1 i—1 

Finally, we obtain (18) based on (29) □ 

Next we bound the distance ||wt+i — w\\'^ for an arbi¬ 
trary w G W. This bound together with Lemma 4 helps to 
analyze the convergence of our algorithm. 

In following Lemma 5, 6 and Theorem 2, we denote ft = 
E™ 1 ft for simplicity. 

Lemma 5. Under Assumption 1 and 2, for any w G W 
and for all t, we have 


|wt+i — w|| < (1 -F 2at+iL + 2L -\ - 

m ■“ 

t — 1 


-2A) ||wt - wll- {ft {wt) - ft {w)) 

m 


■ 4L-y 

m • ^ '' 


\Wt-Wt\ 


2=1 


EK + ^Ei) 


m ^ ^ 
2=1 


(30) 


Proof. For any w GW and all t, we use (25) fo have 



Based on 


lllFt+i - w\\ - ||wt - w\\ 

< (||wJt+i - w\\ - ||wt - w||) (llwt+i - w\\ + \\wt - w||) 

= ||wJt+i - wf - ||wt - wf, (32) 

we can transform (32) to the following inequality: 



Now we pay attention to 




w 


i=l 


9 _ \ 9 _ _ 

= --E (alwt -w) + -Yl(ai + < + y- w)]. 
2 = 1 2=1 

(34) 


First, we compute the inner product: 

^ m 

- - w). 

2=1 

Using (2) and (3) in Assumption 1, we first obtain 
- - w) 

= - {glwt - wl) - {gl,wl - w) 

< llfftll ll^t - w*|| -F ft\w) - fl{wl) - A \\wl - wll 
= II^Jll y* -wl\\ + fl{wt) - fliwl) - A \\wl - w|| 

+ /iH - fliwt) 

< llfftll IK - wl\\ + {gl,wt - wl) - AKJ-Wt|| 

- A \\wl - w|| -F fl{w) - fl{wt) 

<(NII + KII)IK--j|| 

- A IIWt - w|| -F fl{w) - fl{wt) 

< 2L IIWt - w|| - A ||wt - w|| - (/t (wi) - /t H) . (35) 

Adding up the above inequality over i = 1,..., m, we can 
have 

„ m 

— '^{gl.wt-w) 

m 

2=1 
AT ^ 

^ — E K* “ rcJll - 2A ||wt - w|| 

771 . 

2=1 

- — {ft{wt) - ft{w)). (36) 

m 

Then, compute the other inner product: 

2 

-J2(9l + <^l + dl+i,wt-w) 

III . - 
2=1 

^ m 

^-J2ht + <^i + di+i\\ iK-i«ii 

III. - 
2=1 

^ m 

<-E(llfft1l + lkJll + llKill)IK-^ll 

III . - 
2=1 

2 ^ 

< — E {<^t+iL + L + ||crj||) ||wt - w||. (37) 

777 . 

2=1 

In the last inequality, we use (3) and (16). 

Combing (33)-(37), we complete the proof. □ 


Based on Lemma 4 and 5, we give the general regret 
bound in the following lemma. For simplicity, we let ft = 

E m Pi 

2=1 Jt’ 

Lemma 6. We let w* denote the optimal solution com¬ 
puted in hindsight. The regret Rd oi Algorithm 1 is given 
by: 


E [/‘K) - ftiw*)] 


7=1 


< 


(^RL -F 


1-/3 



at 


f ifiOmL 


2L-F 1 
2m 


EEI 

t=l i=l 


mR 


(38) 
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Proof. We use (30) in Lemma 5, which contains the term 
ft{wt) — ft{w), and set w = w*. Then, we rearrange (30) to 
have 

ftiwl) - ftiw*) 

= ft{wt) - ft{w*) + ft{w\) - ft{wt) 


< —(1 — 2A + ‘2ctt+\L + ‘2L H- \ I 

2 ^ m f ^ ' 


) ||wt -w 


T t-1 T t-1 m 

S 2 = “iOmf if EE /3* ^ak + WraL EE/^‘“'=E|| 

t=l k=l t=l k=l i=l ' 

T T T T ra 

< WrnL^ E-E + 30mL E/^^EE|H1I 

fc = l t = l i=l 


t = l k = l 


-^3^E“‘ + t3tEE ’ 


(43) 


+ 2iE ll^‘ “ '^tll - ^ Il^«t+1 - W*\ 


i=l 


+ 


m 


EK* + '^*+i) 


m ^ , 


+ mL llwt - wjll . (39) 


/tK) - Mw*) 


< —(1 — 2A + 2at+iL + 2L + — 

2 TYl ' ^ 


m 


\^t\\) \m-w 


2=1 


^ ll_ *1. 1 

- y ll».+i - » II + 

t-1 


EK + ^t 


t+i; 


6 atmL^ 


t-1 


m 

< — 

- 2 


^ ^ (1 — 2A + 2Q;t+iE + 2L H- ^ ^ (Jt 


\\wt — w 


^Si 


t=l £=1 


Plug in the bound of || Wt — || in Lemma 4, we rewrite 

(39) as 




+^t+i 

2=1 ^ 

m 

(21/ + 1) j|(Tt I + 
2=1 
m 

{2L + 1) ^ jlcr) 


2 r2 

m L at 


Combining S'!, S 2 and S 3 , we get (38). 


(44) 


□ 


+ P'~'"ak + 36»mL ^ ^ ||a^||. (40) 

k=l k=l i=l 

Summing up (40) over t = 1, we have 

T 

E [/‘(w‘) - 


Lemma 6 gives the regret bound with respect to the 
stepsize at and the noise parameter af Further, we analyze 
the regret bounds for convex and sfrongly convex functions. 
Besides, we need to figure ouf the influence fhaf fhe total 
noise have on the regret bounds. 

Theorem 2. Based on Lemma 6, if A > 0 and we sef 
at = fhen the expected regret of our DOLA satisfies: 


.t = l J t=l 

mL / SpOmL 13 

/ 3/36lmL 2L+1\ 2V2mnL 


\ 1 — ^ 2 m 


\e 


(l + logT) 


mR 
2 ’ 
(45) 


and if A = 0 and sef at = then 


— ^ ||wt+i — ' 


+E 


2m 


m 

E (to + '^t+i) 


+ GatmL 


E 


eSi S2 

T £-1 T £-1 m 

+39irf EE /3* '^ak + WmL EEE''Elk^||- (41) 


E/*(^i) 

.t^l 
mL 


1-13 


t=i 

ipOmL 13 


L] Vf - - 
2 \ 2 


t=i fc=i i=i 


S3 


Recall in Assumption 1, R be the upper bound of the 
diameter of W and at+i < at, we compute (41) as follows 


Si = ^ E II 2at+iL + 2L — 2A H- al 


/ 5/36mL 2L+1\ 2i/2mnL / /— 1 

V 1 — /3 2 m J e V 2 


Proof. Firsf we consider ctt = ^, then 


mR 

(46) 


4=2 

m 

T 


E«* = E4 = TE7^T(4 + log^)- (47) 




H —— ||ii)i — w II (1 + 2Qit+iL + 2L — 2A H-N ^£ 

2 " " ' m II I 

2=1 


m 


-^ ||iA)T+l — W* II 

S ^ E + 4 E ||.i|| 1 + - ™Rr(A - L) 


Since crj is drawn from Lap (/i) and each component of 
the vector a} is independent, we have 

T m T 

EElk‘|| = "*Eiitoii 


< f matL + al 


(42) 


= [i]E--+lto Mi^ 

£=1 

= mi/n^ y/\L\j]f, (48) 
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where at [j] denotes an arbitrary component of the vec¬ 
tor at. Under the condition, at[j] ^ Lap (p), we have 
IE [Wt b']|^] = then 


E 


T m 


EEIK1 


= E 




CTt 




r— S (t) m\/2n 
— 2 _^ = > - 

t=i t=i ^ 

2\/2mnL 

<—^E«‘ 

< -(1 + logT). 

The last inequality follows form (47). 

Then, using (47) and (49), we get (45). 

If A = 0, and we set at = we have 

E«* = E^<v^-i 

Using (30), we rewrite (29) as 

T m 

E EE Ik; 


< 




Now, using (50) and (51) we get (46). 


(49) 


(50) 

(51) 
□ 


As expected, we respectively obtain the square root 
regret 0{VT) and the logarithmic regret O(logr) of Algo¬ 
rithm 1 in Theorem 2. Intuitively, except for T, the regret 
bounds are also with respect to the size of distributed 
network m. More importantly, the total noise added to the 
outputs has the magnitude of the same order of 0 {'/T) and 
0(log T). This means that guaranteeing differential privacy 
has no strong influence on the non-private DOLA. The 
reason why this happens is that the magnitude of the total 
noise is with respect to the stepsize at from (29). It has the 
similar form as the non-private regret. Thus, the final regret 
bound with noise has the same order of non-private regret 
bound. 


5 Application to private distributed of¬ 
fline LEARNING USING MINI-BATCH 

In Section 4, we proposed a differentially private DOLA 
with good r^ret bounds of 0{\/T) and O(logT). Kakade 
and Tewari (lot l and Jain et al. @] have both proposed that 
online learning algorithms with good regret bounds can be 
used to achieve fast convergence rates for offline learning 
algorithms. Based on the analysis in 01/ we exploit this 
application in distributed scenarios. Before that, we first 
discuss the private distributed offline learning using mini¬ 
batch. 

In distributed offline learning scenarios, we also assume 
that there are m offline learners. Each learner can obtain the 
labelled examples (e.p., {x\,y'\) ,... (a;^,2/Ji)) from its local 
data source. Differing from the distributed online learners, 
the offline learners have the data beforehand. Before we 
describe the distributed offline learning model, we should 


pay attention to how the centralized offline learning model 
works. 

In a centralized offline learning model, the classical 
method of training such a model based on labelled data is 
by optimizing the following problem: 

1 ” 

w* = argmin- V£(u;,Xfc, 2 /fc) -b ^||w|b, (52) 

where £ is a convex loss function. According to the 
different choices of £ in machine learning, we can 
obtain different data mining algorithms. For example. 
Support Vector Machine (SVM) algorithm comes from 
£{w,x,y) = max (1 — O) and Logistic Regression 

algorithm comes from £ {w, x, y) = log (l + exp {—yw^ x)). 
For solving the problem in (52), stochastic gradient descent 
(SGD) (mentioned in d) was proposed. SGD updates the 
iterate at round t as: 


wt+i =wt- at+i (V£ {wt,xt,yt) + ^wt), (53) 

where this iterate is updated based on a single point {xt,yt) 
sampled randomly from the local data set. 

Next, based on the centralized offline learning model, we 
build the distributed offline learning model. In distributed 
model, each learner updates its parameter with subgradient 
as (53) does. Meanwhile, each learner must exchange infor¬ 
mation with other learners. Hence, for disfributed offline 
learning we update the iterate as: 

m 

wj+i = E “y (^t + vw"t) ■ (54) 

In offline leaning framework, all data are available be¬ 
forehand. To handle such massive training points, we use 
SGD with mini-batch to update the iterate. Using mini¬ 
batch, we update the iterate at round t on the basis of 
a subset Ht of examples. This help us process multiple 
sampled examples instead of a single one at each round. 
Under this model, our offline learning algorithm rims in a 
parallel and distributed method. Based on mini-batch, we 
rewrite (54) as: 

m 

Wt+I = E “b + 1) E 9k, (55) 

J=i {xk,yk)eHt 

where h denotes the number of examples included in Ht 
and Wf is defined in Lemma 2. In (55), we compute an 
average of subgradients of h examples sampled i.i.d. from 
the local data source. 

As with the DOLA, exchanging information also leads 
to a privacy breach in distributed offline learning. Hence, 
to protect the privacy, we make our distributed offline 
learning algorithm guarantee e-differential privacy as well. 
The differentially private method used here is the same 
with that used in Algorithm 1. Furthermore, mini-batch 
can weaken the influence of noise on regret bounds when 
the algorithm guarantees differential privacy For example. 
Song et al. [il demonstrated that differentially private SGD 
algorithm updated with a single point has high variance 
and used mini-batch to reduce the variance. In this paper, 
we also use mini-batch to achieve the same goal. 
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To conclude, we propose a private distributed offline 
learning algorithm using mini-batch. The algorithm is sum¬ 
marized in Algorithm 2. 


Algorithm 2 Differentially Private Distributed Offline 
Learning Using Mini-Batch 


1: Input; Cost functions fl{w) := £{w,xl,yl), i G [l,m] 
and t G [0, T] ; initial points Wg, w^; double stochas¬ 
tic matrix At = G maximum iterations 

T 

h ' 

2: for i = 0,do 

3: for each learner i = 1,..., m do 

m 

4: bl = + 1)(W + where aj is a Laplace 

j=i 

noise vecfor in K" 

5: gl G- y which is computed based on exam¬ 

ples {xk,yk) & Ht 


6 : 

7: 

8 : 

9: 


wj+i = Pro 


bl - at+1 


ipwl 


I E 

{xk,Vk)&Ht 


(Projection onto W) 
broadcast the output (wt+i + 

end for 
end for 



5.1 Privacy analysis for Algorithm 2 

Algorithm 2 guarantees the same level of privacy as Al¬ 
gorithm 1 does. Differing from Algorithm 1, the step 6 in 
Algorithm 2 computes a average of subgradients. According 
to the analysis of the sensitivity in Section 4.1, we easily 
know that the sensitivity of Algorithm 2 must be different 
from ( 8 ). Then, we compute new sensitivity of Algorithm 2 
in the following lemma. 

Lemma 7. (Sensitivity of Algorithm 2) Under Assump¬ 
tion 1 , let all definitions made previously be used here again, 
the Li-sensitivity of Algorithm 2 is 

„ , , 2at\/nL 

S2 (t) < ■ (56) 

We omit the proof of Lemma 7, which follows along the 
lines of Lemma 1. 

Obviously, Lemma 7 demonstrafes that except for the 
parameters in ( 6 ), the magnitude of the sensitivity of Al¬ 
gorithm 2 is with respect to the batch size h. Comparing 
(56) with ( 8 ), we find that the sensitivity of Algorithm 2 
is smaller than that of Algorithm 1. (11) shows that lower 
sensitivity leads to less added noise. So Algorithm 2 can 
add less random noise to its output while it guarantees the 
same level of privacy as Algorithm 1. 

To recall in Lemma 2, we also ensure that the output of 
Algorithm 2 guarantees e-Differential privacy at each round 
t. Then, we consider the following lemma. 

Lemma 8. At the f-th round, the f-th online learner's 
output of Algorithm 2 is e-differentially private. 

The proof follows along the lines of Lemma 2, and is 
omitted. 

To recall, we use mini-batch to reduce the variance. We 
divide the dataset into batches Hi,Ht, which are disjoint 
subsets. According to the theory of parallel composition S 
in differential privacy, we know that the privacy guarantee 


does not degrade across rounds. Based on this observation, 
we can obtain the following theorem, which omits the proof. 

Theorem 3. Using Lemma 8 and the theory of parallel 
composition. Algorithm 2 is e-differentially private. 


5.2 Utility analysis for Algorithm 2 

As described, we next use the regret bounds of Algorithm 1 
to achieve fast convergence rates for Algorithm 2 based on 
Ei. Note that the following Lemma 9 and 10 are proposed 
to prepare for the final result. Theorem 4. 

For a clear description, we first consider the centralized 
offline learning. Let X be the domain of samples Xt and 
Dx denotes a distribution over the domain X. Instead of 
minimizing ( 1 ), we bound 

F (W) — min F (w), (57) 

wGW 

where F {w) = E[f {w, x, y)] , {x, y) ^ and w = 
Then, we obtain the centralized approximation 
error in the following lemma. 

Lemma 9 (fl^). Under Assumption 1, let Rc be the 
regret (e.g., say Rc < log T) of centralized online learning 
algorithm. Then with probability 1 — ly In T, 

F{w)-F{w*) 

^Rc , , 'j In ( 1 / 7 ) 

“T V A T lA’J T’ 

(58) 

where w* G arg min F (w). 

wew 

Intuitively, Lemma 9 relates the online regret to the 
offline convergence rafe. But if we want to have the similar 
lemma when update the iterate as (55), we must know 
the new online regret using mini-batch. Dekel et al. 
demonstrated that the mini-batch update does not improve 
the regret but also not significantly hurt the update rule. 
Based on their analysis, we obtain 

Rcmb < hRc, (59) 


where Rcmb denotes the centralized regret with mini-batch 
and h is the size of Ht. 

Lemma 10. Under Assumption 1, for the centralized 
offline learning update with mini-batch, if we update the 
iterate as (55), then with probability 1 — 47 In T, we have 

Fmb (ui) Fmb (ui ) 

h'^Rc , II^ In (T/h) hy/hR^ 

f 16L^ ) /iln ( 1 / 7 ) 

+ max|^, 6 |-^ (60) 

Proof. Substituting T/h (see step 2 in Algorithm 2) for T in 
(58) and using Rcmb < hRc, we obtain (60). □ 

Lemma 10 is the utility analysis for the centralized 
model, while Algorithm 2 is a distributed offline learning 
algorithm using mini-batch. Next, we analyze the utility of 
the distributed model on the basis of Lemma 10. Similarly, 
we shall use the regret of Algorithm 1 to achieve the fast 
convergence rate for Algorithm 2. 
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(a) Synthetic data with nodes=64 (b) RCVl data with nodes=64 



(a) Synthetic data with size=l 



Number of iterations X10^ 


(b) Synthetic data with size=5 



(c) Synthetic data with e = 0.1 



(d) RCVl data with e = 0.1 




(c) RCVl data with size=l (d) RCVl data with size=5 


Fig. 1. (a) and (b): Regret vs Privacy on synthetic and RCVl datasets, 
(c) and (d): Regret vs Nodes on synthetic and RCVl datasets. Note 
that the y-axis denotes the average regret (normaiized by the number of 
iterations). 


Fig. 2. (a) and (b): Regret vs Batch size on synthetic dataset, (c) and 
(d): Regret vs Batch size on RCVl dataset. Note that this figure shows 
the variance and mean of the average regret (normaiized by the number 
of iterations). 


Theorem 4 (Utility of Algorithm 2). Under Assumption 
1, the regret Rd of Algorithm 1 can be used to achieve the 
convergence rate for Algorithm 2. Then, with probability 
1 — 47 In T, we have 

Rdmh ) Rdmb ^ 

^ h^Ro . I In {T/h) hy^hRu/m 
- mT V T 

f 16L^ I /i In ( 1 / 7 ) 

+ max|^,6|-(61) 

Tjh 

where Fdmb (w) = Edmb [/ {w\ x, 

' t=i 

Vroof. We estimate the convergence rate with respect to an 
arbitrary learner i. So we use the regret of a single learner, 
RYdlm. Based on (60), we substitute Rd/tti for Rq, then 
obtain (61). □ 

Based on (3l and 0, we study the application of regret 
bounds to offline convergence rates in distributed scenarios. 
Our work also have the same three significant advantages 
in j2]- Except for these existing advantages, we find new 
advantages in distributed scenarios: 1) the corresponding 
algorithms converge faster; 2) guaranteeing the same level 
of privacy needs less noise; 3) the noise of same magnitude 
has less influence on the utility of algorithms. 

6 SIMULATIONS 

In this section, we conduct two sets of simulations. One is 
to study the privacy and regret trade-offs for our DOLA. 
The other is to illustrate how well the mini-batch performs 
to reduce high variance of differential privacy in the offline 
learning algorithm. For our implementations, we have the 
hinge loss function fl{w) = max (l — (w, aij)), where 


, yl) G M” X {±1}} are the data available only to the i- 
th learner. For fast convergence rates, we set the learning 
rate nt = ^- Furthermore, we do experiments on both 
synthetic and real datasets. The synthetic data are generated 
from a unit ball of dimensionality d = 10. We generate 
a total of 100,000 labeled examples. The real data used in 
our simulation is a subset of the RCVl dataset. For a sharp 
contrast, this subset has the same number of examples with 
the synthetic data. As shown in Algorithm 1 and 2, the 
dataset is divided into m subsets. Each node updates the 
parameter based on its own subset and timely exchanges 
the updates its parameter to neighbors. Note that at round 
t, the i-th learner must exchange the parameter wl in strict 
accordance with Assumption 2. For a good observation, we 
sum the normalized error bounds (i.e., the "Regret" on y- 
axis) for both Figure 1 and 2. 

Figure 1 (a) and (b) show the average regret (normalized 
by the number of iterations) incurred by our DOLA for 
different level of privacy e on synthetic and RCVl datasets. 
Our differentially private DOLA has low-regret even for 
a little high level of privacy (e.g., e = 0.01). The regret 
obtained by the non-private algorithm has the lowest regret 
as expected. More significantly, the regret gets closer to the 
non-private regret as its privacy preservation is weaker. 
Figure 1 (c) and (d) show the average regret for different 
nodes of the online system on the same level of privacy. 
Clearly, the centralized online learning algorithm {node = 1) 
has the lowest regret on the level of privacy e = 0.1 and 
the regret gets lower as its number of nodes is smaller. 
Furthermore, the regret on synthetic data performs better 
than that on real data under the same conditions. 

Figure 2 (a) and (b) show the average regret for different 
batch size on synthetic data. When batch size is one (see 
Figure 2 (a)), the differentially private regret has higher 
variance than the non-private regret. However, a modest 
batch size = 5, as shown in Figure 2 (b), reduces the vari- 
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ance of our differentially private distributed offline learning 
algorithm. The mini-batch technique makes the variance of 
differentially private distributed offline learning algorithm 
nearly identical to that of the non-private offline algorithm. 
Figure 2 (c) and (d) show the same simulation on RCVl data 
and obtain the same conclusion with Figure 2 (a) and (b). 


TABLE 1 


Method 

Nodes 

Accuracy 

Non-private 

1 

82.51% 

4 

74.64% 

64 

65.72% 

Private 

€ = 1 

1 

82.51% 

e = 1 

4 

74.64% 

e = 1 

64 

65.72% 

e = 0.1 

1 

80.17% 

e = 0.1 

4 

70.86% 

e = 0.1 

64 

62.34% 

e = 0.01 

1 

75.69% 

e = 0.01 

4 

64.81% 

e = 0.01 

64 

50.36% 


As we know, the hinge loss I (w) — max (l — yw 
leads to the data mining algorithm, SVM. To be more 
persuasive, we conduct a differentially private distributed 
SVM and test this algorithm on RCVl data. Table 1 shows 
the accuracy for different level of privacy and different 
number of nodes of algorithm. Intuitively, the centralized 
non-private model has the highest accuracy 88.74% while 
the model of 64 nodes at a high level e = 0.01 of privacy 
has the lowest accuracy 50.36%. Further, we conclude that 
the accuracy gets higher as the level of privacy is lower or 
the number of nodes is smaller. This conclusion goes along 
with Figure 1 and 2. 

7 CONCLUSION AND DISCUSSION 

We have proposed a differentially private distributed online 
learning algorithm. We used subgradient to update the 
learning parameter and used random doubly stochastic 
matrix to guide the learners to communicate with others. 
More importantly, our network topology is time-variant. As 
expected, we obtained the regret bounds in the order of 
0{Vt) and O(logT). Interestingly, the magnitude of the 
total noise added to guarantee e-differential privacy also 
has the order of 0 {VT) and O(logT) along with the non¬ 
private regret. 

Furthermore, we used our private distributed online 
learning algorithm with good regret bounds to solve the 
private distributed offline learning problems. In order to 
reduce high variance of our differentially private algorithm, 
we use the mini-batch technique to weaken the influence of 
added noise. This method makes the algorithm guarantee 
the same level of privacy using less random noise. 

In this paper, we did not take the delay into consider¬ 
ation. In distributed online learning scenarios, there must 
exist delays among the nodes when they communicate with 
others, which is hard to analyze. Because each node has 
different delay according to its communication graph and 
the graph is even time-variant. Then, in future work, we 
hope that distributed online learning with delay can be 
presented. 
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